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Often, when animals encounter an unexpected sensory event, they transition from 
executing a variety of movements to repeating the movement(s) that may have caused 
the event. According to a recent theory of action discovery (Redgrave and Gurney, 2006), 
repetition allows the animal to represent those movements, and the outcome, as an action 
for later recruitment. The transition from variation to repetition often follows a non-random, 
structured, pattern. While the structure of the pattern can be explained by sophisticated 
cognitive mechanisms, simpler mechanisms based on dopaminergic modulation of basal 
ganglia (BG) activity are thought to underlie action discovery (Redgrave and Gurney, 2006). 
In this paper we ask the question: can simple BG-mediated mechanisms account for 
a structured transition from variation to repetition, or are more sophisticated cognitive 
mechanisms always necessary? To address this question, we present a computational 
model of BG-mediated biasing of behavior. In our model, unlike most other models of BG 
function, the BG biases behavior through modulation of cortical response to excitation; 
many possible movements are represented by the cortical area; and excitation to the 
cortical area is topographically-organized. We subject the model to simple reaching tasks, 
inspired by behavioral studies, in which a location to which to reach must be selected. 
Locations within a target area elicit a reinforcement signal. A structured transition from 
variation to repetition emerges from simple BG-mediated biasing of cortical response 
to excitation. We show how the structured pattern influences behavior in simple and 
complicated tasks. We also present analyses that describe the structured transition from 
variation to repetition due to BG-mediated biasing and from biasing that would be expected 
from a type of cognitive biasing, allowing us to compare behavior resulting from these 
types of biasing and make connections with future behavioral experiments. 
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1. INTRODUCTION 

Animals are capable of executing a huge variety of movements 
but, importantly, they can discover the specific movements that 
affect the environment in predictable ways and represent them 
as actions for later recruitment. Redgrave, Gurney, and colleagues 
have suggested that this occurs through a process they refer to 
as action discovery (Redgrave and Gurney, 2006; Redgrave et al., 
2008, 2011, 2013; Gurney et al., 2013). Action discovery begins 
when an animal is executing movements within some context 
and an unexpected salient sensory event (such as a light flash) 
occurs. The unexpected sensory event causes a short-latency 
phasic increase in dopamine (DA) neuron activity (henceforth 
referred to simply as DA activity). Through its influence on the 
basal ganglia (BG) — a group of interconnected subcortical struc- 
tures which, in turn, influence cortical activity — the increase in 
DA activity can help bias the animal to repeat the movements that 
preceded the unexpected sensory event under the same contextual 
circumstances. This repetition bias (Redgrave and Gurney, 2006) 
allows associative networks in the brain to learn and encode the 



movements as an action because it causes a frequent and reliable 
presentation of context, movements, and the sensory event as the 
outcome of those movements. 

This transition from executing a variety of movements to 
repeating just one or a subset of movements often follows a non- 
random, structured, pattern. For example, consider a spatial task 
such that reaching to a specific location results in the outcome. 
Here, one type of structured transition from variation to rep- 
etition occurs if the animal gradually refines its movements so 
that movements that are further from the location decrease in 
frequency earlier than movements that are closer to the location. 

The non-random structure of the transition from variation 
to repetition can be explained with "intelligent" or sophisticated 
cognitive mechanisms, e.g., by using an estimation of the range of 
movements that cause the outcome that gets more and more pre- 
cise with repeated occurrences of the outcome. Similarly, other 
types of a structured transition may rely on other sophisticated 
notions such as optimality or uncertainty (e.g., Dearden et al. 
1998; Dimitrakakis 2006; Simsek and Barto 2006). However, the 
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process of action discovery is thought to be mediated primar- 
ily by simpler mechanisms involving DA modulation of the BG, 
and not sophisticated cognitive mechanisms. In this paper we 
ask the question, can simple BG-mediated mechanisms guide a 
structured transition from variation to repetition, or must sophis- 
ticated cognitive mechanisms always be recruited? To address this 
question, we present a computational model of BG-mediated 
biasing of behavior. 

Our model will necessarily deal with a specific and, therefore, 
limited example of action discovery and so to establish its sta- 
tus, we now outline the model's wider context comprising various 
broad categories of action. For example, one type of action might 
involve making a particular gesture with the hand (as in sign lan- 
guage or hand signaling), regardless of the precise spatial location 
of the hand, and no environmental object is targeted. Another 
type of action involves manipulating objects in the environment 
(such as flipping a light switch or typing out a password). In 
this instance, space is weakly implicit (the objects are located 
somewhere); the key feature is the target object identity and its 
manipulation. In this paper, we focus on an explicitly spatial task: 
the relatively simple action of moving an end-effector to a partic- 
ular spatial location. In the model task, a movement end-point to 
which to move must be selected. End-points that correspond to 
a target location elicit a reinforcement signal, and, importantly, 
reinforcement is not contingent on movement trajectory. The 
model task is inspired by behavioral counterparts we have used 
to study action discovery in which participants manipulate a joy- 
stick to find an invisible target area in the workspace (Stafford 
et al., 2012, 2013; Thirkettle et al, 2013a,b). While there may be 
"gestural" aspects of action in the behavioral task, in the model we 
ignore these and focus only on the spatial location of movement 
end-point. 

In the next few paragraphs, we describe features of neural pro- 
cessing which our model incorporates that many other models 
of the BG do not. Biological theories of BG function suggest 
that the BG bias behavior not through direct excitation of their 
efferent targets, but, rather, through the selective relaxation of 
inhibition (i.e., disinhibition) of their efferent targets (Chevalier 
and Deniau, 1990; Mink, 1996; Redgrave et al, 2011). When 
the BG are presented with multiple signals, each representing an 
action or movement, these signals will have different activity lev- 
els signifying the urgency or salience of the "action request." BG 
are supposed to process each signal through a neural population 
or channel, and inter-channel connections facilitate competitive 
processes resulting in suppression of BG output (inhibition) on 
high salience channels and increased output on the low salience 
channels (Gurney et al, 2001a,b; Humphries and Gurney, 2002; 
Prescott et al., 2006). Many models of BG function focus on how 
the multiple signals presented to the BG are transformed to the 
activity of the BG's output nucleus. Action selection in these mod- 
els is then based on the latter's activity (e.g., Gurney et al. 2001a,b, 
2004; Joel et al. 2002; Daw et al. 2005; Shah and Barto 2009). 
However, one important feature of our model is that it also takes 
into account the pattern of excitation from other areas to the BG's 
efferent targets (see also Humphries and Gurney 2002; Cohen and 
Frank 2009; Baldassarre et al. 2013). Thus, behavior results from 
BG modulation of their efferent target's response to excitation 



patterns, and is not just a mirror of the activity of the BG's output 
nucleus. 

Further, many models of BG function focus on how the BG 
select from a small number of abstract independent behaviors 
(e.g., Gurney et al. 2001b; Daw et al. 2005; Cohen and Frank 2009; 
Shah and Barto 2009). While such representations maybe appro- 
priate for some behavioral tasks in experimental psychology, in 
ethological action discovery, the space of activities from which to 
select may be larger and adhere to some inherent topology. In our 
model, candidate locations to which to move are represented by a 
large number of topographically-organized neurons in cortex so 
that neighboring spatial locations are represented by neighbor- 
ing neurons. Excitation to cortex follows a pattern in which all 
neurons are weakly excited initially, and that pattern evolves so 
that eventually only one neuron is excited strongly. This pattern 
is inspired by neural activity observed in perceptual decision- 
making tasks (Britten et al., 1992; Piatt and Glimcher, 1999; Huk 
and Shadlen, 2005; Gold and Shadlen, 2007), and as suggested by 
evidence accumulation models of decision-making (Bogacz et al, 
2006; Lepora et al., 2012). 

We hypothesize that because the BG bias behavior by modulat- 
ing cortical response to excitation, and that that excitation follows 
a structured pattern, simple BG-mediated biasing can result in a 
structured transition from variation to repetition in action dis- 
covery. Sophisticated cognitive mechanisms are not necessarily 
required to develop a structured transition. 

In addition, behavioral biasing in action discovery is not 
thought to be driven by "extrinsic motivations" that are based on 
rewarding consequences and that dictate reinforcement in many 
types of operant conditioning tasks (Thorndike, 1911; Skinner, 
1938) and computational reinforcement learning (RL) (Bertsekas 
and Tsitsiklis, 1996; Sutton and Barto, 1998). Rather, "intrin- 
sic motivations" (Oudeyer and Kaplan, 2007; Baldassarre, 2011; 
Barto, 2013; Barto et al, 2013; Gottlieb et al, 2013; Gurney 
et al., 2013) that are triggered by the occurrence of an unexpected 
sensory event may drive DA activity and thus behavioral bias- 
ing in action discovery (Redgrave and Gurney, 2006; Redgrave 
et al, 2008, 2011, 2013; Gurney et al, 2013; Mirolli et al, 2013). 
In such cases, if the outcome does not represent or predict an 
extrinsically- rewarding event, reinforcement decreases as asso- 
ciative networks in the brain learn to predict its occurrence 
(Redgrave and Gurney, 2006; Redgrave et al., 2011). Rather than 
implement a model of prediction explicitly, we approximate its 
effects with a simple model of habituation in which the rate 
of reinforcement decreases as the target location is repeatedly 
hit (Marsland, 2009). This habituation model approximates the 
dependence of DA activity on outcome predictability in action 
discovery (Redgrave and Gurney, 2006; Redgrave et al., 2011), 
and is similar to that used in neural network models of novelty 
detection (Marsland, 2009). 

In this paper, we use computational models to demonstrate 
that simple BG-mediated mechanisms can bias behavior, via their 
modulation of cortical response to a pattern of excitation, such 
that the transition from variation to repetition follows a struc- 
tured pattern. We describe this structured pattern and show how 
it, along with the effects of habituation, lead to behavioral pat- 
terns in tasks in which one target area delivers a reinforcement 
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signal, two target areas deliver reinforcement, or the target area 
that delivers reinforcement changes location. These experiments 
lead to predictions as to the type of behavior that would be 
expected when only simple BG-mediated mechanisms, and not 
more sophisticated cognitive mechanisms, bias behavior. We also 
run models that mimic a simple form of transition from variation 
to repetition that would be expected under sophisticated cogni- 
tive mechanisms by subsuming the effects of those mechanisms 
in a phenomenological way. In order to make contact with future 
behavioral experiments, we develop a novel characterization of 
behavioral trends which links these trends to underlying neural 
mechanisms that dictate different forms of biasing. 

2. METHODS 

We use a computational model, based on established models 
(Gurney et al., 2001a,b; Humphries and Gurney, 2002), to control 
movement selection in a task that simulates reaching or pointing 
to specific target spatial locations. We provide here a conceptual 
overview of its mechanics; detailed equations are provided in the 
Supplementary section. 

The model is a neural network model with leaky-integrator 
neuron units (henceforth referred to as "neurons" for brevity), 
the activities of which represent conglomerate neural firing rate 
of a group of neurons (Gurney et al., 2001a,b). Each brain area 
in the model, except for the area labeled "Context," consists of 
196 neurons spatially arranged in a 14 x 14 grid. Each neuron in 
each area is part of an "action channel" (Gurney et al, 2001a,b; 
Humphries and Gurney, 2002) such that its location in the grid 
corresponds to a movement toward the corresponding location of 
a two-dimensional workspace. For the purposes of this model, the 
workspace is of dimensions 14 x 14 units. Most projections from 
one area to another are one-to-one and not plastic; exceptions will 
be explicitly noted. 

Figure 1 illustrates the gross architecture of the model. In brief, 
the end-point location of a movement, Xm, is determined by the 
activities of neurons in "M (Cortex)." These neurons are excited 
by an exploratory mechanism, "E (Explorer)," and are engaged 
in positive feedback loops with neurons in "T (Thalamus)." 
The basal ganglia (BG, gray boxes) send inhibitory projec- 
tions to Thalamus neurons, and they modulate the gain of the 
Cortex-Thalamus positive feedback loops (Chambers et al, 201 1) 
through selective disinhibition of Thalamus neurons. Cortex and 
Thalamus represent grids of neurons that correspond to motor- 
related areas of cortex and thalamus, respectively. 

2.1. EXCITATORY INPUTS TO THE NEURAL NETWORK 

There are two sources of excitatory input to the neural net- 
work.The first is labeled "C (Context)" and represents the context, 
such as participating in the current experiment. There is only one 
context for the results reported in this paper. Thus, Context con- 
sists of a single neuron with an output activity set to a constant 
value. Context influences BG activity through one-to-all projec- 
tions to areas Dl, D2, and STN. Projections to Dl and D2 are 
plastic and represent a context-dependent biasing of movements, 
as described in the subsection "Biasing of behavior." 

The second source of excitatory input is "E (Explorer)," which 
provides excitation to Cortex which, in turn, is responsible for 



movement. The Explorer is the source of variation required to 
explore the space of possible movements. This variation may be 
more or less random or structured according to the strategy used. 
However, these strategies are devised by other mechanisms, not 
explicitly modeled here, and we simply aim to capture the effects 
of such strategies in the Explorer. 

In this paper, the Explorer is inspired by a range of experi- 
mental data. First, recordings in some areas of parietal cortices 
(Anderson and Buneo, 2002) show activation of neurons corre- 
sponding to a decision to make a movement that terminates at 
the location represented by those neurons. Further, several exper- 
imental studies, (Britten et al, 1992; Piatt and Glimcher, 1999; 
Huk and Shadlen, 2005; Gold and Shadlen, 2007) show that neu- 
rons representing different decisions are weakly active early in 
the decision-making process. The activities of some neurons — 
corresponding to the executed decision in these experiments — 
increase at a greater rate than that of other neurons. 

We capture features of this behavior with a hand-crafted func- 
tion describing, for a decision to move to a particular spatial loca- 
tion, the evolution of activity for every neuron in the Explorer. 
Early in the process, all neurons are weakly-excited with low acti- 
vation levels. Neural activity evolves such that, as confidence in a 
particular movement increases, so does the corresponding neuron 
activity. The activities of other neurons increase to a lesser degree. 
An example of this behavior is shown in Figure 2; it is described 
in greater detail in the next paragraph and in the Supplementary 
section. 

For each movement, a particular neuron in Explorer, labeled 
Gexp> is chosen. If we suppose that sophisticated cognitive mech- 
anisms are not devoted to movement selection, G ex p is chosen 
randomly. The activity of the neuron corresponding to G e xp 
increases linearly to one (green line in Figure 2). The activities of 
surrounding neurons change according to a Gaussian-like func- 
tion centered at G exp - They first increase and then decrease; those 
furthest from G ex p increase by a small amount and then quickly 
decrease to zero, while those closer to G ex p increase by a larger 
amount and decrease at a later time point to zero. The pattern of 
activity such that the activity of neuron G exp is one and the activi- 
ties of all other neurons are at zero is held for brief time, and then 
the activities of all neurons are set to zero. This evolution takes Te 
time steps, which is the number of time steps in a trial. 

If, in contrast, we assume sophisticated cognitive mechanisms 
do influence movement selection, G exp is chosen in order to 
reflect that strategy, e.g., according to some heuristic search such 
as a spiral pattern or quadrant-by-quadrant search. In this paper 
we examine behavior that results when cognitive mechanisms do 
not influence movement selection as well as behavior that results 
from a simple pattern, as described in the subsection "Biasing of 
behavior." 

2.2. CORTEX AND THALAMUS 

"Cortex" represents cortical areas that encode high-level move- 
ment plans such as reaching or pointing to a location (Anderson 
and Buneo, 2002). In our model, the spatial location of a 
neuron in Cortex corresponds to a target spatial location in 
the workspace, or movement end-point, to which to reach. 
Cortex (M) receives excitatory projections from Explorer and 
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FIGURE 1 | Architecture of the model. Each box except for "C (Context)" contains 196 neurons spatially arranged in a 14 x 14 grid. Context contains just one 
neuron. Types of projections are labeled in the legend on the right. 




Time step 



FIGURE 2 | Example of activity of Explorer neurons during a typical 
movement. The activity of the neuron corresponding to the focus of 
excitation, Gexp, is drawn in green. Selected neurons, colored in the inset, are 



drawn with thick lines in different shades of gray so as to demonstrate the 
spatial influence on excitation pattern. All other neurons are drawn in thin 
gray lines 



Thalamus (T) which preserve channel identity; that is, the neu- 
rons representing a given channel in Explorer and Thalamus 
project to the corresponding neuron in Cortex. In turn, Thalamus 
receives channel- wise excitatory projections from Cortex, and 
channel-wise inhibitory projections from SNr (a nucleus of 
the BG called the substantia nigra pars reticulata). Cortex and 
Thalamus therefore form a positive feedback loop referred to as 
a Cortex-Thalamus loop, for each channel which is excited by 
the corresponding channel in Explorer. The gain of a Cortex- 
Thalamus loop is modulated by inhibitory projections from SNr 
neuron to Thalamus (Chambers et al., 2011). When the activity 
level of an SNr channel is low, the corresponding Thalamus neu- 
ron is said to be disinhibited and its Cortex-Thalamus loop has 
a high gain. A Cortex-Thalamus loop with a high gain is more 
easily-excited by the corresponding Explorer neuron. 



2.3. BASAL GANGLIA 

The functional properties of BG architecture have been described 
in detail in prior work (Gurney et al., 2001a,b; Humphries and 
Gurney, 2002; Redgrave et al., 2011). Briefly, the BG is a sub- 
cortical group of brain areas with intrinsic architecture that is 
well-suited to select one behavioral option among competing 
options. The BG implement an off-center on-surround excitation 
pattern: The BG channel i that is most strongly-excited by its cor- 
tical "action request" inhibits the corresponding target channel 
(neuron) in Thalamus the least, while other Thalamus chan- 
nels j i are further inhibited. Thus, Cortex-Thalamus loop i is 
most easily-excited by input from Explorer to Cortex, and other 
Cortex-Thalamus loops ; i are harder to excite by input from 
Explorer to Cortex. These properties are similar in some ways 
to those of a winner-take-all network between the competing 
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channels, but additional architectural features of the BG ensure 
better control of the balance between excitation and inhibition 
(Gurney et al., 2001a,b). Dl and D2 refer to different popu- 
lations of neurons (named after the dopamine receptors they 
predominantly-express) in a nucleus of the BG called the stria- 
tum. The pathway comprising Dl and STN (subthalamic nucleus) 
performs the selection with an off-center on-surround network in 
which Dl supplies focussed ("central") inhibition and the STN a 
diffuse ("surround") excitation. The pathway through D2 regu- 
lates the selection by controlling, though GPe (external segment 
of the globus pallidus), the excitatory activity of STN (Gurney 
etal., 2001a,b). 

2.4. FROM CORTICAL ACTIVITY TO BEHAVIOR 

Movement in this model is a function of the activities of the 
Cortex neurons. Each neuron with an activation greater than a 
threshold r| "votes" to move to the location represented by its 
grid location with a strength proportional to its activity (i.e., 
using a population code, Georgopoulos et al. 1982). In most cases, 
because of the selection properties of the BG, the activation of 
only one Cortex neuron rises above r\. At each time step f, the 
target location to which to move, -Xjvf(f)> is an average of the 
locations represented by Cortex neurons with activities above T), 
weighted by their activities. At each t, if any Cortex neuron is 
above r), a simple "motor plant" causes a movement from the cur- 
rent position (xp(t)) toward Xm(0 (see Supplementary section 
for equations). Movement evaluation, and hence any learning, is 
based only on Xp(TE), the position at time Te (the last time step of 
a trial). Thus, end-point of movement, not movement trajectory, 
is evaluated in this model. 

2.5. BIASING OF BEHAVIOR 

Targets are circular areas within the workspace. A target is consid- 
ered hit when | \Xp(Tg) — Xg\\ < 9g> where Xq is the location of 
the center of target G and Qg (= 1.1) is the radius. Thus, a move- 
ment to the location represented by neuron i that corresponds to 
the center of the target, or to locations represented by the immedi- 
ate four neighboring neurons, is within the target's radius. When 
a target is hit, behavior is biased so that the model is more likely to 
make movements to the target. This repetition bias (Redgrave and 
Gurney, 2006) can be implemented in two ways in this model. 

The first way is "BG-mediated biasing," which is based on 
dopamine-dependent plasticity at the corticostriatal synapses 
(Calabresi et al., 2007; Wickens, 2009), and is implemented as 
a Hebbian-like rule governing plasticity to weights onto striatal 
Dl and D2 neurons. When the end-point of movement is eval- 
uated (at time Te of a trial), usually only one neuron (i) in each 
of Cortex, Dl, and D2 have an activity above zero. If the target is 
hit, the weights from Cortex neuron i to Dl neuron i, Cortex neu- 
ron i to D2 neuron i, the Context neuron to Dl neuron i, and the 
Context neuron to D2 neuron i are increased according to equa- 
tions of the following form (see Supplementary section for full 
equations): 

Awi = a fi Nk ~ l y pre y post (W max - vf«), (1) 

where w; is the weight, y pre is the activity of the presynaptic 
neuron, y post is the activity of the postsynaptic neuron, a is a 



step-size, W max (= 1) is the maximum strength of a synapse, P 
(= 0.825) is a habituation term (Marsland, 2009), and Nk is the 
number of times target k has been hit. If the target is not hit, the 
weights are decreased. Weights from Cortex to striatum have a 
lower limit of zero, while weights from Context to striatum have a 
lower limit of — 0. 1. Neurons that have greater afferent weights are 
more-easily excited than are neurons with lower afferent weights. 

Neurons in Dl and D2 that correspond to movements that 
were reinforced are excited by the Context neuron from the 
first time step of a trial onward, and neurons that correspond 
to movements that were not reinforced are weakly inhibited 
by the Context neuron. (We use negative weights to approx- 
imate the inhibitory effects of striatal interneurons, Koos and 
Tepper 1999; Bolam et al. 2006). Thus, weights from the Context 
neuron to Dl and D2 represent an a priori bias in favor of 
movements that were reinforced, and against movements that 
were not reinforced. This bias is context-dependent and, while 
there is only one context for the results reported in this paper, 
multiple contexts can be represented by multiple context neu- 
rons with similar learning rules. Neurons in Dl and D2 are 
also excited by Cortex neurons, which, early in a trial, are all 
weakly-excited by Explorer. Because the projections from Cortex 
to Dl and D2 are plastic, movements that were reinforced are 
more-easily excited by Cortex than movements that were not 
reinforced. 

Thus, with BG-mediated biasing, channels corresponding to 
making a movement to locations that are within the target area 
are easily-excited by weak inputs from the Explorer after the target 
has been hit several times. Channels corresponding to move- 
ments that do not hit the target are made to be more difficult to 
excite. 

The second way by which repetition bias is implemented in this 
model is referred to as "Cognitive biasing," whereby G exp is chosen 
according to some strategy or pattern. Under cognitive biasing 
in this paper, the set of neurons in Explorer from which G ex p is 
chosen corresponds to a spatial area, centered around the location 
of the target, that decreases in size each time the target is hit (we 
describe this pattern in detail in the Supplementary section). This 
is a simple hand-crafted form of biasing that mimics a decrease in 
variation and increase in repetition by "zooming in" on the target 
as the target is repeatedly hit. It is meant to capture the effects 
of behavioral biasing as mediated by "sophisticated cognitive" or 
"intelligent" mechanisms. If there is no Cognitive biasing, G ex p is 
randomly chosen as described earlier. 

2.6. MODEL EXPERIMENTS 

A model run consists of having the model select movements for 
300 trials (where a trial consists of executing one movement). 
Movements were reinforced (Equation 1 ) when they hit a partic- 
ular target. We examined behavior that results from reinforcing 
one target, two targets simultaneously, and one target and then 
another. The targets are referred to G\, G2f ar (which is far from 
Gi), and G2near (which is near G\). Experiments 1 to 4 were 
conducted to describe patterns of behavior under simple, "non- 
intelligent," BG-mediated biasing and different conditions of 
reinforcement. Experiment 5 was conducted to describe patterns 
of behavior under BG biasing, Cognitive biasing, and both. 
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• Experiment 1: Single target (Gi): We ran 50 independent runs 
of 300 movements during which BG biasing (and not Cognitive 
biasing) was used to reinforce movements that hit G\. 

• Experiment 2: Two simultaneous targets (Gi and G2f ar ): We 
ran 50 independent runs of 300 movements during which 
BG biasing was used to reinforce movements that hit either 
Gi or G 2 f ar - 

• Experiment 3: Reinforce G\ , then G2f ar , then G\ again: We ran 

50 independent runs of 900 movements during which BG bias- 
ing was used to reinforce movements that hit G\ for the first 
300 movements, then to reinforce movements that hit G2f ar 
(but not those that hit G\ ) for the next 300 movements, and 
then reinforce movements that hit G\ (but not those that hit 
G2far) for the final 300 movements. 

• Experiment 4: Reinforce G\, then either G2f ar or G2 nea r: We 
ran 50 independent runs of 600 movements during which BG 
biasing was used to reinforce movements that hit G\ for the 
first 300 movements and then to reinforce G2f ar (but not those 
that hit Gi) for the next 300 movements. We ran another 50 
independent runs of 600 movements during which BG biasing 
was used to reinforce movements that hit G\ for the first 300 
movements and then to reinforce G2ne a r (but not those that hit 
Gi ) for the next 300 movements. 

• Experiment 5: Different bias conditions: We ran 50 indepen- 
dent runs of 300 movements during which Cognitive biasing 
(and not BG biasing) was used to reinforce movements that hit 
G\. We ran another 50 independent runs of 300 movements 
during which both BG biasing and Cognitive biasing were used 
to reinforce movements that hit G\. 

3. RESULTS 

3.1. EXPERIMENT 1: SINGLE TARGET (G,) 

Recall that there are two sources of excitation to the model, as 
explained in Methods section 2.1: the Context neuron, which 
projects to Dl, D2, and STN; and the Explorer, which projects 
to Cortex (see also Figure 1). As described in Methods sec- 
tion 2.1, a focus of excitation, G ex p, is chosen randomly, and 
the activities of neurons in the Explorer follow a hand-crafted 
pattern such that all neurons are weakly-excited initially, but 
that activity focuses so that only the neuron corresponding 
to G ex p is strongly-excited (see Figure 2). If the weights onto 
Dl and D2 remain at their initial values, Explorer activity 
will result in a movement made to the location represented 
by G exp - 

In Experiment 1, there was a single target, G\, located in 
the lower right area of the work space (center of target col- 
ored in red in the upper left graph in Figure 3). When the 
target was first hit, it was because the Explorer happened to 
choose a G exp that was within 6g of target center. As described 
in Methods section 2.5, when the target is hit, the corti- 
costriatal weights that project to striatal neurons correspond- 
ing to the movement just made are increased (Equation 1). 
When a target is not hit, the weights decrease. The weight 
change influences how the BG modulates the gain between 
Thalamus and Cortex positive feedback loops (Methods sections 
2.2 and 2.3), and hence how Cortex responds to excitation from 
Explorer. 



Neural activity 

Figure 3 shows selected neuron activity resulting from the 
same excitation from the Explorer during early movements 
("before learning") and during late movements ("after learning"). 
Excitation from Explorer is illustrated in the lower left graph, 
and the color scheme indicating which neuron's activity is plot- 
ted is illustrated in the upper left graph. In this example, activities 
of neurons corresponding movements made to G exp are plotted 
in green; those corresponding to the center of the target (Gi) 
are plotted in red; and those corresponding to a subset of neu- 
rons near or between G ex p and G\ are plotted in shades of gray. 
(Compare with Figure 2 and Methods section 2.1.) G exp is not 
within the target area. The top row of graphs to the right of the 
color scheme graph plot neuron activity in striatum Dl, neuron 
activity in SNr, and neuron activity in Cortex in the untrained 
model. As excitation from Explorer evolved over time, Cortex 
neurons increased accordingly due to the direct one-to-one pro- 
jections from Explorer to Cortex and positive feedback loops 
with Thalamus (as described in Methods section 2.2). Cortex 
activity directly excited striatal neurons due to direct one-to- 
one projections to striatum Dl and striatum D2 (as described in 
Methods section 2.3). In this case, striatal neurons corresponding 
to Gexp increased in activity. Because no learning has occurred 
yet, Context did not bias activity in striatum as all projections 
from Context to striatum remained at zero. Intra-BG process- 
ing (described in Methods section 2.3) resulted in a decrease in 
activity of SNr neuron corresponding to G eX p, and an increase 
in all other SNr neurons. This disinhibited the Thalamus neu- 
ron corresponding to G ex p, increasing the gain on the positive 
feedback loop with Cortex neuron corresponding to G ex p, thus 
allowing it to increase in activity even more. In addition, the 
increased activity of all other SNr neurons further decreased the 
positive feedback gain between other Cortex-Thalamus neuron 
pairs (Chambers et al., 2011). In this example, weights into Dl 
and D2 have not undergone any changes, i.e., the target has not 
been hit, so there is no biasing from Context. Thus, the BG facili- 
tated the selection of the movement suggested by Explorer (move 
to location G ex p) and inhibited the selection of other movements. 

After the target had been hit many times, the weights from 
Context to striatal neurons Dl and D2, and from Cortex to Dl 
and D2, that correspond to movements made to a location within 
the target zone (in this example, the center of G\) increased 
(as described in Methods section 2.5 and Equation 1), and the 
weights to all others decreased by a small amount. Neuron activ- 
ity in response to the same excitation from Explorer after learning 
is illustrated in the bottom, right most three graphs of Figure 3. 
Neurons that correspond to Gi (plotted in red) are referred to 
as sg- Because weights from Context to sg in Dl and D2 have 
increased, the activity of neuron sg in Dl and D2 increased faster 
due to excitation from Cortex than did that of other neurons, 
including that of neurons that correspond to movements made 
to G ex p- This caused a decrease in the activity of SNr neuron 
sg and an increase in the gain of the corresponding Cortex- 
Thalamus positive feedback loop (described in Methods section 
2.2). Hence, the weak excitation to Cortex neuron sq at the begin- 
ning of a movement period was sufficient to initiate a positive 
feedback process between the corresponding neuron sg in Cortex 
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FIGURE 3 | Example neural activity for selected neurons from D1, SNr, 
and Cortex (M) at different points in training. Neurons are colored 
according to their spatial location in the grid (top left). The red neuron 
corresponds to the center of target G-\ , the green neuron corresponds to 
the location of Gexp (focus of excitation in Explorer), which is not within 
the target area. Other neurons, most of which are located in between the 
61 and G exp , are colored in gray (darker gray neurons are closer to G exp ). 



Bottom left: activities of Explorer neurons. Rightmost three graphs on 

top: activities of neurons from D1, SNr, and M before learning (i.e., before 
the target was hit). Rightmost three graphs on bottom: activities of the 
same neurons after learning, i.e., after the target was hit several times. 
Note that the maximum of the vertical axis of SNr is 0.5, while that of the 
other graphs is one. Horizontal dashed line of graphs for M (right) 
represents t). 



and Thalamus, causing more excitation to neuron sg in Dl and 
D2, even further disinhibition of the feedback loop, and further 
inhibition of the loops of other neurons. BG-mediated bias was 
in favor of movements toward G\, implemented by an increase in 
weights from Context and Cortex to the neurons in Dl and D2 
that correspond to a movement to G\ (Equation 1). Thus, Cortex 
neuron sg increased above r| and movement was made to the 
location corresponding to G\, even though the Explorer more- 
strongly excited neurons corresponding to movements made 
to G ex p- 

Movement redistribution under contextual bias 

The biasing of activity within the BG, BG's regulation of Cortex - 
Thalamus loop excitability, and the gradual focusing of excita- 
tion from Explorer to Cortex, comprise simple mechanisms that 
results in a seemingly "intelligent" structured transition from 
variability to repetition. After the target had been hit by chance a 
few times, weights from Context to neurons sg in Dl and D2, and 
weights from neuron sg in Cortex to neurons sg in Dl and D2, 
were increased a little (Equation 1). When Explorer later chooses 
G ex p near Gi, the resulting relatively high excitation to Cortex 
neuron sg, combined with the increased gain at Cortex- Thalamus 
loop sg and decreased gain to other loops, excited Cortex neu- 
ron sg while preventing other Cortex neurons from increasing 
past t). Thus, a movement to the target was made when Explorer 
chose G ex p near Gi : the target was hit with an increased likeli- 
hood, and movements to areas near the target were made with 
a decreased likelihood. We refer to this pattern as a "bias zone," 
centered at G\, that increases in size the more often the target 
is hit. 



Figure 4 shows how the bias zone increases as the number of 
times the target has been hit increases. In order to produce this 
figure, the model was run with G ex p set to G\ for a set number 
of times. Then, learning was turned off and model response for 
G ex p set to each possible location was examined. Each graph in 
Figure 4 plots the location of G ex p in the workspace: green dots 
indicate locations of G ex p that result in movements made to those 
locations; red dots indicate locations of G exp that result in move- 
ments made to locations within the target area (red circle). The 
title of each graph indicates how many times G exp was set to Gi 
before response was examined. The expansion of the bias zone 
determines an "intelligent-looking" structured transition from 
variation to repetition in that it follows a non-random pattern. 

For the purposes of this paper, model behavior is considered 
to be well-learned when a "streak" of hitting the target with ten 
consecutive movements is achieved. Figure 5, top left, plots the 
proportion of 50 runs that achieved this streak by various points 
of experience. About 40% reached it by 100 movements, and 
almost 80% reached it by 300 movements. A little over 20% did 
not achieve it by 300 movements. Figure 5, bottom left, plots the 
proportion of 50 runs that hit the target as a function of move- 
ment number. The proportion reaches about 0.8 by movement 
number 300. 

Figure 5, right, plots, for each movement across the 50 runs, 
the distance between the movement and Gi as a function of move- 
ment number. The distance of movements that hit G\ are plotted 
in red (and are all at zero). As movement number increases, the 
density of movements near G\ but that did not hit G\ decreases 
at a faster rate than the density of movements far from G\. This 
pattern is due to the expanding bias zone (Figure 4). We develop 
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FIGURE 4 | Illustration of the "bias zone" effect. In each graph, the target 
was first hit N times (labeled at the top of the graph). Then, learning was 
turned off and movement for each possible value of G oxp was evaluated. Each 
dot represents the spatial location corresponding to G ex p- Large green dots 



represent locations of G 8xp that resulted in movements that hit the location 
corresponding to Gexp- Small red dots represent locations of G eX p that, because 
of biasing implemented by weights onto D1 and D2, resulted in movements 
that hit the target (represented by the red circle in the lower right). 
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FIGURE 5 | Performance across all 50 runs for Experiment 1: A single 
target (and only BG biasing). Top left: proportion of runs that achieved 
streak of hitting target ten consecutive times by the movement 
50, 100, 150 or 300. Note that the bar graphs are cumulative. Bottom 



left: proportion of runs that hit target as a function of movement number. 
Right: Distance from the center of the target (in units of target radius) of 
each movement from all 50 runs. That for movements that hit the target are 
drawn in red and are at value 0 of the vertical axis. 



a method for quantifying this pattern in the section describ- 
ing results of Experiment 5 (and in the Supplementary section). 
Experiment 4 describes behavior in a more complicated task that 
results from this pattern. 

Effect of cortical noise on model performance 

The capability of the model to bias movements toward G\ 
is due in part to the pattern of excitation from Explorer to 
Cortex (Figure 2), which weakly-excites all Cortex neurons by 
very similar amounts early in a trial. This suggests that model 
performance may be sensitive to unpredicted deviations from 
this pattern. To investigate this, we ran simulations in which 
signal-dependent noise (Harris and Wolpert, 1998) was added 
to Cortex neurons (which project to the BG and Thalamus, and 
from which movement is determined). In particular, at each time 
step: y <— [y + y N(0, ct)]q, where y is the output activity of a 
Cortex neuron, N(0, a) refers to a number drawn randomly from 
a zero-mean Gaussian distribution with standard deviation a, and 
[x]q returns 0 if x < 0, 1 if jc > 1 , and x otherwise. The proportion 



of the last 30 movements of all runs under a particular noise con- 
dition that were made to Gi were 0.82, 0.64, 0.53, and 0.20 for 
a levels of 0 (no noise), 0.1, 0.3, and 0.5, respectively. Thus, the 
model was able to learn to repeatedly hit Gi if a low to moderate 
level of noise was added to Cortex neuron activity, but perfor- 
mance dropped off with high levels of noise. Figure 6 illustrates, 
in a manner similar to Figure 3, example model neuron activity 
for a model run with a = 0.1. The rest of the simulations in this 
paper were run with no noise. 

3.2. EXPERIMENT 2: TWO SIMULTANEOUS TARGETS (fi, AND G 2F ar) 

Movements that hit either of two targets, Gi (lower right of the 
workspace) or G2f ar (upper left) (red and blue circles, respectively, 
in Figure 7), were reinforced according to Equation 1. However, 
the habituation term differentiated them. (The habituation term 
is p-^* -1 in Equation 1, where Nk is the number of times target 
k has been hit and p = 0.825.) For example, even if G\ was hit 
many times, at the first time G2f ar was hit, it was a novel event and 
thus the corresponding weights increased by a large amount. 
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FIGURE 6 | Example neural activity for selected neurons from D1, SNr, 
and Cortex (M) at different points in training for a model with low levels 
of noise (o = 0.1) added to Cortex (M) neuron activity (see text for 
details). This figure is plotted in a manner similar to that of Figure 3. Neurons 
are colored according to their spatial location in the grid (top left). The red 
neuron corresponds to the reinforced movement that hit target Gi in this 



example. Note, however, that, unlike with Figure 3, the reinforced movement 
is one unit away from the center of Gi (the center of Gi is marked with a 
closed red circle). (Recall that the radius of the target is 1 .1 units, so 
movements made to the center of Gi or to the immediate neighbors of the 
center are reinforced.) The green neuron corresponds to the location of G Bxp 
(focus of excitation in Explorer), which is not within the target area. 
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FIGURE 7 | Distribution of movements in Experiment 2: two 
simultaneous targets. Left: Proportion of runs that were classified as being 
biased toward one target (more than half of the movements hit target Gi , red, 
° r C>2far. blue); both targets (more than a quarter of the movements hit Gi and 
more than a quarter hit G2f ar , gray); or none (all others, black). Right three 
graphs: example movement distributions for runs that were classified as 



being biased toward Gi , G2f ar , or both. Similar to Figure 4, learning was 
stopped (this time at the end of the 300 movements) and then movement 
resulting from each possible value of G 8xp (represented by the spatial locations 
of the dots) was evaluated. Red dots indicate the location of G exp that resulted 
in a movement made to Gi (red circle); blue dots indicate a movement made 
to G2far (blue circle), and green dots indicate a movement made to G exp . 



Figure 7, left, plots the proportion of runs that were classified 
as either biased toward one of the targets, distributed between 
the two targets, or did not find a target (see figure caption for 
details on the classification criteria). While behavior in a majority 
of the runs was biased to a single target (e.g., middle two graphs 
of Figure 7), the model was capable of distributing movements 
to both targets (e.g., Figure 7, right). For runs which were biased 
to just one target, only a G ex p very near the un-preferred target 
produced a movement to that target. 

3.3. EXPERIMENT 3: REINFORCE G,, THEN G 2FAR , THEN G, AGAIN 

The use of experience-based learning rules — weight modification 
(Equation 1) is dependent on actual behavior — and a habituation 
term leads to a type of memory that can influence subsequent 



behavior in a changing environment. This is illustrated with 
experiments in which only movements to Gi are reinforced for 
300 movements, then only movements to G2f ar are reinforced (at 
which point the habituation term for Gi is reset), and then only 
movements to Gi are reinforced again. As shown in Figure 8, 
top row, which plots the proportion of runs that hit each tar- 
get as a function of movement number, the reacquisition of Gi 
(movements 601-900) occurred faster than the initial acquisition 
(movements 1-300) of Gi. 

The enhanced acquisition is because corticostriatal weights 
corresponding to movements toward G\, illustrated in red in 
Figure 8, bottom row, increased to a stable value (of about 0.2 
in the figure) during first acquisition. (The habituation pre- 
vents it from increasing any more after the target had been 
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FIGURE 8 | Time-course of behavior and corticostriatal weights for 
Experiment 3: reinforce G, (movements 1-300), then G 2far (301-600), 
then G-i again (601-900). Top: proportion of runs that hit G-] (red) or G2f ar 
(blue) as a function of movement number. The proportion of runs that hit Gi 
during movements 1-300 are redrawn at horizontal positions 601-900 as a 
gray line for comparison of performance between initial acquisition 
(movements 1-300) and reacquisition (movements 601-900) of Gi . Bottom: 
Mean (across runs) weight from Context neuron to the D1 neuron 
corresponding to most movements that hit Gi (red) or G 2 f ar (blue) for that 



particular run. The D1 neuron that corresponded to most movements that hit 
each target was determined by finding the maximum weight from Context to 
D1 neurons at the end of each 300 movement segment. Because several 
movements can hit each target, only runs in which the same D1 neuron was 
selected at movement 300 and movement 900 (i.e., for movements that hit 
Gi) were included (16 out of 50 runs were excluded). That for weights from 
Context neuron to D2 neurons followed a similar pattern and are not plotted. 
Similar to the graphs in the top row, mean weight during movements 1-300 
are plotted again at movements 601-900 in gray for comparison purposes. 
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FIGURE 9 | Behavior for Experiment 4: reinforce Gi (movements 1-300) 
and then either G 2 f ar or G 2near (301-600). Left: locations of the three 
targets in the workspace (dots indicate locations corresponding to possible 
values of G exp , colored gray if those locations do not lie within a target area. 
Right: proportion of runs that hit Gi for movements 1-300 or G2f a r or G2near 
for movements 301-600. 



repeatedly hit.) During movements 301-600, G2f ar was rein- 
forced (and Gi was no longer reinforced). The model contin- 
ued to move to Gi early in the second set of movements, but, 
because Gi was no longer reinforced, the corresponding weights 
decreased. As the weights decreased, the bias zone around G\ 
decreased and the model was free to move to other locations, 
including toward G2f ar . As a new bias zone, now centered on 
G2far> was established, the model stopped moving to Gi . Because 
movements toward Gi were no longer made, weights associ- 
ated with moving to G\ ceased to decrease. When movements 
to Gi were reinforced again, those weights were already above 
zero and thus Gi was reacquired faster than it was initially 
acquired. In addition, due to resetting the habituation term, 
the weights increased to a greater value than the previous high 
value. 

This pattern of activity provides a simple mechanism that can 
be used to partially explain the findings that practice sessions that 
are separated in time lead to enhanced acquisition and perfor- 
mance compared to practice sessions that are massed together 
(Ammons, 1950; Baddeley and Longman, 1978) (though such 
effects do not necessarily apply to all types of tasks, e.g., Lee and 
Genovese 1989). 

3.4. EXPERIMENT 4: REINFORCE fij, THEN EITHER G 2FAR OR G 2NE ar 

When one target is reinforced for a period of time, and then 
another is reinforced instead, how well the second reinforced 
target is acquired depends on its proximity to the first tar- 
get. This is illustrated by comparing the results of experi- 
ments in which the second target (G2f ar , blue in Figure 9) 
was far from the first one with those in which the second 
target (G2near> purple) was near the first one. Figure 9 plots 



the proportion of runs for which the first and second tar- 
gets were hit as a function of movement for the different sets. 
The first target (Gi, red) was acquired the fastest. The far sec- 
ond target (G2f ar ) was acquired faster than the near second 

target (G 2n ear)- 

The discrepancy between acquiring the second targets is 
explained by the bias zone. A well-learned model has corticos- 
triatal weights such that the bias zone is large. When the bias 
zone is centered around G\, un-reinforced movements to G\ 
must happen in order for weights to decrease, after which the 
bias zone shrinks and movements to other locations can be made. 
Movements to locations far from G\ are available earlier than 
movements to locations near Gi as the bias zone shrinks. Thus, 
a second target far from G\ will be more-easily acquired than a 
second target near Gi . 
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3.5. EXPERIMENT 5: MOVEMENT REDISTRIBUTION UNDER DIFFERENT 
BIAS CONDITIONS 

As movements made to a target increase, movements made 
to other locations must decrease: movements are redistributed 
over the workspace. The previous sections focused on move- 
ment redistribution in our model with only BG-mediated biasing 
(Equation 1). Here we describe metrics of movement redis- 
tribution that will allow us to compare how movements are 
redistributed under different bias conditions. We focus on model 
runs in which only movements made to one target (Gi) were 
reinforced. 

Redistribution metric 

The expanding bias zone (Results section 3.1 and Figure 4) that 
results from BG-mediated biasing results in a pattern of behavior 
such that movements made near, but not at, the target decrease 
in likelihood earlier than movements made far from the target. 
For each run, we quantify the rate of decrease as a function 
of distance from target. Briefly (see Figure 10), movements that 
did not hit the target were coarsely categorized into three tem- 
poral chunks and three spatial zones (vertical and horizontal 
lines, respectively, in Figure 10). Temporal chunk one includes 
the first 100 movements; temporal chunk two includes the second 
100 movements; and temporal chunk three includes the last 100 
movements. Recalling that 8g is target radius and letting dX be the 
distance of a movement from target center, the spatial zones are 1) 
6 G < dX < 59 G (green points in Figure 10), 2) 56 G < dX < 96 G 
(blue), and 3) 90 G < dX (black). The number of movements that 
fell into spatial zone i from temporal chunks 1 to 2 to 3 was fit to 
an equation of the form e i<;, ~ 1 \ where j refers to temporal chunk. 
The rate of decrease of the number movements was quantified by 
the parameter b t . A more negative b\ indicates a greater rate of 
decrease (see the Supplementary section for more details). 

Movement redistribution across different bias conditions 

Figure 10, top row, graphs movement distance from target as a 
function of movement number for three sample runs under BG- 
mediated bias (these graphs are similar to Figure 5, right). In 
all three cases, b\ < bi < b^, i.e., the rate of decrease of move- 
ments made near but not at the target is greater than that of 
movements made far from the target. This is in line with the 
behavioral pattern we would expect given the expanding bias zone 
(Figure 4) that results from BG-mediated biasing. Regarding the 
specific sample runs in Figure 10, top row, the rate of decrease of 
movements from the first sample run that fell within zone one 
is greater than that of the second sample run, which is greater 
than that of the third sample run. This, also, is reflected in the b 
metrics. 

The same process was used to determine b metrics for models 
that biased movement selection with different mechanisms. Recall 
from Methods section 2.5 that, if there is no "Cognitive bias," 
movements suggested by the Explorer (G exp ) were randomly 
selected from a uniform distribution over all possible move- 
ments. Under the Cognitive bias scheme (described in Methods 
section 2.5 and the Supplementary section), every time the tar- 
get is hit, the set of possible movements from which G ex p is 
selected decreases: movements further from target center are 



removed from the set earlier than movements closer to target 
center. Movement redistribution under a Cognitive bias thus fol- 
lows a trend opposite that under BG-mediated bias: b\ > £>2 > &3 
(Figure 10, bottom row). 

For a given run of a model using BG-mediated bias, b for spa- 
tial zones closer to the target should be more negative than b 
for zones farther from the target. Thus, we expect bj, — bi > 0 
and b2 — b\ > 0 in models using BG-mediated bias. Models using 
the Cognitive bias should exhibit opposite behavior: b$ — t>2 < 0 
and bi_ — b\ < 0. The differences should be zero if the transition 
from variation to repetition does not follow a structured pattern 
(i.e., the frequency of movements to non-target areas decreases 
uniformly). 

Figure 11 plots the distribution of pair- wise (by run) differ- 
ences bj, — &2 (right column, black) and bi_ — b\ (left column, 
blue) of model runs using different bias conditions (arranged by 
row). The means of the distributions were also tested against the 
null hypothesis that they are zero (single sample one-tailed f- 
tests). The distributions of the pair- wise differences for models 
using a BG bias (top row) were positive; that for models using a 
Cognitive bias (bottom) were negative; and that for using both 
biasing mechanisms (middle) were also negative (though visual 
inspection suggests that the Cognitive bias condition has more 
extreme negative pair-wise differences than does the combined 
bias condition). Thus, this analysis was able to capture the general 
trends that were seen in the different bias conditions of the model. 

4. DISCUSSION 

As described in a recent theory of action discovery (Redgrave and 
Gurney, 2006; Redgrave et al., 2008, 2011, 2013; Gurney et al., 
2013), when an unexpected sensory event occurs, animals tran- 
sition from executing a variety of movements to repeating move- 
ments that may have caused the event. A transition from variation 
to repetition often follows non-random, structured patterns that 
may be explained with sophisticated cognitive mechanisms (e.g., 
Dearden et al. 1998; Dimitrakakis 2006; Simsek and Barto 2006). 
However, in action discovery, simple non-cognitive mechanisms 
involving dopamine modulation of basal ganglia (BG) activity 
are thought to play a prominent role in behavioral biasing. In 
this paper we use a biologically-plausible computational model 
to demonstrate that a structured transition from variation to 
repetition can emerge from processing within such simple mech- 
anisms. Such behavior is due to the following features on which 
our model, unlike most previous models of BG function, focuses: 
(i) the BG does not bias behavior directly, but modulates cor- 
tical response to excitation (Chevalier and Deniau, 1990; Mink, 
1996; Humphries and Gurney, 2002; Cohen and Frank, 2009; 
Redgrave et al, 2011; Baldassarre et al, 2013); (ii) excitation to 
cortex follows a pattern that evolves from weakly exciting all neu- 
rons to strongly exciting only one neuron (Britten et al., 1992; 
Piatt and Glimcher, 1999; Huk and Shadlen, 2005; Bogacz et al, 
2006; Gold and Shadlen, 2007; Lepora et al., 2012). By including 
these features in our model, we show that sophisticated cognitive 
mechanisms may not always be necessary to develop a structured 
transition from variation to repetition. 

In our model, movements occur by selecting an end-point 
(spatial location) to which to move. Movements that terminated 



www.frontiersin.org 



February 2014 | Volume 5 | Article 91 | 11 



Shah and Gurney 



BG-mediated transition from variation to repetition 



BG bias sample run 1 



BG bias sample run 2 



BG bias sample run 3 



• • • • 

• •• • 


• 


• • • 

• b : -0.45 

• 

• * * »/ 


m 

% • 
• • ■ 


»•* 


• • b : -0.72 
*• • *• • 


• 

* ■» • *• 

• •• • 
■>• „• 


• 

• • 


.V. 

* m b\-Q.S7 


• "Li • 

• • •* • 

.••* " ' 




b: -1.38 


>* • > 

• • • 




5 

b : -5.56 


* a 
\ ' •• .' 


■ •* 


b: -1.58 






1 















1 00 200 300 





*• .* 
*••.» 






•: . t •>.' 
• 




.• 

*»r *•-*• 
• • « 


... 0 % 











100 200 300 



100 200 300 



Cognitive bias sample run 1 



i? 9 

B 

E 



• • * • 






•• •• 

»— • . 
. • * :•• 

v. 



















100 200 

Movement 



Cognitive bias sample run 2 















o ... -■% a 

9 w 

• -•»*• 

% °86> ■ 


• • 
•* 











100 200 300 



Cognitive bias sample run 3 



• • • 

• 

■ * 

mm % * 








r «i . .•«. 


. • . •*■« 
• • 

*• 


°« o' 

% • 




* 



100 200 300 



FIGURE 10 | Distance from Gi of executed movement (in units of 8q) as target center) in which the movement lies. Horizontal lines indicate spatial 

a function of movement number (similar to Figure 5, right) for sample zone borders. Vertical lines indicate temporal chunk borders. The parameter b 

runs of models using BG-mediated biasing (top row) or Cognitive indicates the rate of decrease of movements falling within each spatial zone, 

biasing (bottom). Color of dot indicates the spatial zone (defined relative to The more negative b is, the greater the rate of decrease (see text for details). 



in a target area were reinforced so that the selection of such 
end-points increased in frequency. The transition from executing 
a variety of movements to executing just the reinforced move- 
ments followed a structured pattern: as end-points at the target 
location increased in frequency, end-points near, but not at, the 
target location decreased in frequency at a greater rate than end- 
points far from the target. We refer to the area around the target 
area in which end-point frequency decreased as a "bias zone" 
(Figures 4, 10, top), and the bias zone increased in size as the tar- 
get was repeatedly hit. The graded shift from variation (a small 
bias zone) to repetition (a large bias zone) allows for the discov- 
ery of a second target area in some cases (Figure 7), and also 
results in specific patterns of behavior if the target area moves 
(Figures 8, 9). 

In addition, in action discovery, phasic DA activity in response 
to achievement of the outcome (e.g., hitting the reinforced target 
area) decreases as associative brain areas learn to predict the out- 
come's occurrence (Redgrave and Gurney, 2006; Redgrave et al., 
2008, 2011, 2013; Gurney et al, 2013; Mirolli et al, 2013). This 
may be thought of as a type of intrinsic motivation (IM) in that 
the outcome need not have hedonic value in order to be rein- 
forcing (Oudeyer and Kaplan, 2007; Baldassarre, 2011; Barto, 
2013; Barto et al, 2013; Gottlieb et al, 2013; Gurney et al., 
2013). The type of IM in action discovery is best described as 
some combination of novelty and surprise (Barto et al, 2013). 
A detailed account of exactly how the prediction process may 
be implemented in the brain is beyond the scope of this paper. 



We mimic its effects in our model with a simple habituation 
mechanism similar that used in neural network models of nov- 
elty detection (Marsland, 2009). Here, the reinforcing effects of 
an outcome with which the model has little recent experience 
is greater than the reinforcing effects of an outcome with which 
the model has much recent experience. The habituation term 
(pNt-i m equation 1) influences behavioral patterns, particu- 
larly in tasks in which more than one target area is reinforced 
(Figure 7) or the target area changes (Figures 8, 9). Unlike the 
reward prediction error hypothesis of phasic DA neuron activity 
(Houk et al, 1995; Schultz et al, 1997), habituation is a mecha- 
nism that does not rely on extrinsic motivation by which phasic 
DA neuron activity, and hence rate of change of the rate of corti- 
costriatal plasticity, decreases with continued occurrences of the 
outcome. 

We also implement models in which a structured transition 
from variation to repetition is that which would be expected if 
one type of more sophisticated mechanism ("Cognitive biasing") 
is in effect. The pattern of behavior (Figure 10, bottom) is then 
different than that of BG-only biasing. Finally, we have devised a 
method for capturing such differences with quantitative measures 
(Figures 10, 11) which will allow us to make contact with future 
behavioral experiments investigating how different brain areas 
contribute to biasing behavior in tasks similar to model tasks. In 
continuing work, we are devising such behavioral experiments. 
Preliminary results suggest that our quantitative measure will 
allow us to compare the effects of different biasing mechanisms 
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FIGURE 11 | Comparing redistribution patterns in Experiment 5: 
movement redistribution under different bias conditions. Each graph is 
a histogram illustrating the distribution of pair-wise differences (by run) of b 
parameters of model runs using different bias conditions (arranged by row). 
That for C3 — £>2 is in the right column (black); that for b% — 61 is in the left 
column (blue). Bin widths and locations were determined as follows: the 
minimum and maximum b of all possible b from all analyzed runs from all 
conditions defined the range of possible o. This range was divided into 20 
evenly-spaced bins of uniform size. The means of the samples of b 
parameters were tested according to the hypotheses that they have a 
mean u,^ > 0 or u,^ < 0 (indicated, along with p value, in each graph). 



by examining behavior from different systems (e.g., model versus 
human), different workspaces, different target sizes, and differ- 
ent target locations, etc. Possible mechanisms by which to isolate 
different brain mechanisms include explicit instructions, use of 
different stimuli (Thirkettle et al., 2013b), or use of distractor 
tasks (Stocco et al, 2009). 

As with any computational model of brain systems, the mech- 
anisms described in this paper should be viewed as being a part 
of a complex system of interacting parts. We've isolated the effects 
of the specific mechanisms we've investigated in order to demon- 
strate how a structured transition from variation to repetition can 
emerge from those mechanisms. In the next subsection we discuss 
the implications of some of these choices in greater detail and how 
to expand on them to include more sophisticated systems. 



4.1. A MULTI-STAGE SELECTION PROCESS 

Recall that, for each movement in our model, the pattern of exci- 
tation from "Explorer" to "Cortex" evolves from weakly- exciting 
all neurons to strongly- exciting one neuron (referred to as G ex p> 
the focus of excitation). The weak excitation of all neurons early 
in the evolution allows for corticostriatal plasticity to bias behav- 
ior. Behavior can also be biased by the choice of G eX p> the effects of 
which are greater later in the evolution. Thus, the evolving exci- 
tation pattern from Explorer to Cortex allows for a multi-stage 
selection process. We expand on these points below. 

Through corticostriatal plasticity and BG selection mecha- 
nisms, Cortex neurons that are only weakly excited during the 
early stages of excitation from Explorer can increase in activ- 
ity at a greater rate than other Cortex neurons. BG selection 
mechanisms also enable these neurons to suppress the responses 
of other Cortex neurons to subsequent strong excitation (e.g., 
Figure 3). The expanding bias zone (described in Results sec- 
tion 3.1 and Figure 4) that is seen in models using BG-mediated 
biasing emerges from the pattern of excitation from Explorer to 
Cortex. Because the model task was a spatial reaching task, a 
topographic representation was used that revealed an apparent 
dependency between movements: neurons in Explorer near the 
focus of excitation (G ex p) were excited more than neurons far 
from the focus. 

However, a different pattern may be revealed in other types of 
tasks. In general, the pattern of activity is likely to be influenced 
by perceptual processing of sensory information. For example, 
the theory of affordances (Gibson, 1977, 1986) suggests that the 
perception of objects preferentially primes neurons that corre- 
spond to actions that can operate on those objects, e.g., the 
perception of a mug would prime a grasping action. Thus, 
the pattern of excitation in these conditions would preferen- 
tially excite those neurons, and excitation may follow a pat- 
tern that is different than the one used in this paper. Because 
BG modulates how Cortex responds to excitation rather than 
directly-exciting movements, any behavioral pattern controlled 
by BG-mediated biasing would depend on the pattern of exci- 
tation to Cortex. Thus, different patterns of exploration, and 
different patterns of a structured transition from variation to 
repetition, would be observed in different environments and 
tasks. 

We envision that more sophisticated mechanisms (e.g., our 
Cognitive biasing) can be expressed in our model in the later 
part of the evolving excitation pattern of the Explorer, i.e., in how 
G ex p is chosen. One such mechanism may search the workspace 
in a way that is more intelligent than random, such as a spiral or 
raster-like search pattern that does not repeat itself until all possi- 
ble movements have been executed. The choice of G ex p could also 
be adaptive, including using mechanisms by which a transition 
from variation to repetition is governed by mechanisms based 
on measures of optimality, uncertainty, or other task- related vari- 
ables (Dearden et al., 1998; Daw et al., 2006; Dimitrakakis, 2006; 
Simsek and Barto, 2006; Cohen et al, 2007). 

Thus, the early part of the evolving excitation pattern from 
Explorer to Cortex comprises weak excitation that is influenced by 
perception of the environment (e.g., affordances or, in our model, 
possible movement locations) or simple mechanisms. The later 
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part of the evolution allows for more complicated mechanisms 
that may require more processing time to also influence behavior. 
We have focused mostly on simple mechanisms in this paper, but 
the evolving pattern of excitation can be used to implement pro- 
posed theories that focus on multiple influences on behavior, e.g., 
Kawato (1990); Rosenstein and Barto (2004); Daw et al. (2005); 
Shah and Barto (2009). 

4.2. ACTION DISCOVERY WITH COMPLICATED BEHAVIORS 

There are many types of movements or behaviors that can affect 
the environment, e.g., making a gesture (regardless of spatial 
location), manipulating objects in the environment, or making 
a sequence of movements. In this paper we focused on a simple 
type of action in which the system, able to select a spatial end- 
point of movement, must discover the end-point(s) that delivers 
an outcome. On a more abstract level, this is similar to "n-armed 
bandit" problems, in which the system must discover which out 
of a set of n actions is followed by the most rewarding conse- 
quences in a one-step decision task (e.g., Sutton and Barto 1998). 
The general process of action discovery (Redgrave and Gurney, 
2006; Redgrave et al, 2008, 201 1, 2013; Gurney et al., 2013) is also 
concerned with discovering the temporal and structural compo- 
nents of a complex behavior that affects the environment. These 
problems are similar to the those of temporal and structural credit 
assignment problems (Minksy, 1961; Sutton, 1984, 1988; Barto, 
1985; Sutton and Barto, 1998), which we briefly describe below. 

One form of the temporal credit assignment problem is 
exposed in systems in which a series of actions is required in 
order to achieve an outcome, and there is great redundancy: a 
large number of different (but possibly overlapping) sequences 
can achieve the outcome. How does the agent discover the most 
direct sequence, i.e., the sequence that uses the fewest actions? 
This redundancy is often resolved by assigning a cost for each 
executed action and using optimal control methods to achieve 
the goal while also minimizing cost (e.g., Sutton and Barto 
1998). However, optimal control methods, which are designed 
to find behavior that minimizes cost according to an arbitrary 
cost function, may use mechanisms that are more sophisticated 
and complicated than those thought to underly action discov- 
ery. Recent modeling work (Shah and Gurney, 201 1; Chersi et al., 
2013) has shown that a simpler learning rule that does not incor- 
porate cost per action can discover the most direct sequence of 
actions in a redundant system. Such behavior remains stable for 
a period of time, but, if learning is not attenuated, extraneous 
actions are incorporated with extended experience (Shah and 
Gurney, 2011). 

The structural credit assignment problem is exposed when a 
system can execute many actions simultaneously and the out- 
come depends only on the simultaneous execution of a small 
subset of those. When behavior is composed of several compo- 
nents, and the outcome is contingent on only some of those 
components, variation allows the animal to determine which 
components are relevant and to "weed out" the irrelevant compo- 
nents. We have not addressed this problem directly, but previous 
work on the structural credit assignment problem in RL offers 
promising directions (Barto and Sutton, 1981; Barto et al., 1981; 
Barto, 1985; Barto and Anandan, 1985; Gullapalli, 1990). 



4.3. CONCLUSION 

How biasing causes a transition from variation to repetition so as 
to converge on the specific movements that cause an outcome is 
a fundamental problem in the process of action discovery. With 
a simple model of a restricted aspect of action discovery, which 
includes neural processing features not included in most other 
models of BG function, we are able to describe the effects of differ- 
ent types of behavioral biasing. The results reported in this paper 
describe a first step in understanding the more processes at work 
in general action discovery. 
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